Skip to content

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341

Open
thomasbuilds wants to merge 1 commit into
GrapheneOS:mainfrom
thomasbuilds:madvise-guard-install
Open

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341
thomasbuilds wants to merge 1 commit into
GrapheneOS:mainfrom
thomasbuilds:madvise-guard-install

Conversation

@thomasbuilds

@thomasbuilds thomasbuilds commented May 29, 2026

Copy link
Copy Markdown
Contributor

Addresses the high-VMA-count concern from KERNEL_FEATURE_WISHLIST.md (see #258). MADV_GUARD_INSTALL (Linux 6.13+) lets guard regions live inside a single read-write mapping at the page-table level instead of as separate PROT_NONE VMAs.

Change

Adds CONFIG_GUARD_PAGES_USE_MADVISE (default false). When enabled, guard regions for large allocations are installed with MADV_GUARD_INSTALL inside one read-write mapping rather than carved out as separate PROT_NONE mappings, keeping each large allocation to a single VMA instead of three. This is applied in allocate_pages(), allocate_pages_aligned(), the region quarantine, and the in-place realloc shrink, so the single-VMA property holds under allocation churn rather than only for live allocations. It also holds across the mremap growth path: guard markers move with the mapping and the moved body merges into the never-faulted destination fragments (verified on 6.17). For aligned allocations the installed guards sit exactly adjacent to the usable region, giving the clean [guard][usable][guard] layout discussed in #350.

Syscall cost: MADV_GUARD_INSTALL zaps any existing pages in the range, so no separate MADV_DONTNEED is needed and the quarantine and the shrink path stay at one syscall each (one madvise instead of one mmap). Only allocation pays one extra syscall (mmap + 2x madvise instead of mmap + mprotect).

Kernel support is probed once at runtime on a fresh mapping and cached. Guard installation is best-effort: any madvise failure falls back to the existing PROT_NONE scheme rather than failing the allocation, preserving errno. MADV_GUARD_INSTALL returns EINVAL on VM_LOCKED mappings; that resets the cached state so the next allocation re-probes: under mlockall(MCL_FUTURE) the probe mapping is itself locked and the feature latches off rather than being retried per allocation, while freeing a one-off mlock'd allocation only loses that single call. Under CONFIG_LABEL_MEMORY the quarantined region is labeled as a whole so PR_SET_VMA_ANON_NAME does not split the single VMA back into three.

One sharp edge is documented rather than fixed: guard install is not atomic, so if it fails partway through the realloc-shrink path and the PROT_NONE fallback also fails (two ENOMEMs back to back), part of the discarded tail may be left guarded or zapped while realloc returns NULL. Failing loudly on a later access is preferred over a MADV_GUARD_REMOVE recovery that would silently expose zeroed pages.

Why off by default

In #258 it was noted this would "require having full overcommit enabled if it doesn't reduce the accounted memory", and that is what I measured. Resident memory and total address space are unchanged (RLIMIT_AS unaffected), but private-writable commit charge grows, which regresses strict overcommit (vm.overcommit_memory=2):

  • live allocations: commit grows by the guard size (~260 MiB for 2000 256 KiB allocations);
  • the quarantine dominates: quarantined regions stay committed (~1.9 GiB at default quarantine settings under sustained 1 MiB churn, vs 0 for the PROT_NONE scheme). RSS is still released because guard install zaps the pages (measured at +184 KiB resident after 2560 quarantined 1 MiB frees).

There is also a throughput cost on allocation-rate-bound workloads: guard installation writes a page-table marker for every 4 KiB page and allocates page tables for the guard range, so its cost scales with the randomized guard size, while a PROT_NONE reservation populates no page tables at all. Measured: ~-2% single-threaded churn with pages touched, -26% for pure alloc/free of 256 KiB allocations, ~2x slower in an 8-thread 256 KiB churn stress test (medians of 9 interleaved runs), and several times slower when churning allocations in the tens of MiB, where guards span thousands of pages. Measured TLB shootdown IPIs are slightly lower than with the PROT_NONE scheme, so the cost is in-kernel page-table work rather than interrupt traffic. The win is in whole-process operations that scale with VMA count, which is the actual motivation (see below). Hence opt-in rather than a default behavior change, following the CONFIG_LABEL_MEMORY precedent of a compile-time option defaulting to false.

Measurements

Linux 6.17 x86_64, 8 cores. 2000 concurrently-live 256 KiB allocations, all pages touched:

metric PROT_NONE MADV_GUARD_INSTALL
VMAs +4007 +9
VmRSS +512228 KiB +512216 KiB
VmSize (RLIMIT_AS) +782448 KiB +775368 KiB
VmData (committed) +512160 KiB +775528 KiB

Adjacent single-VMA allocations merge, so it does better than 1 VMA/allocation. VMAs after sustained churn (2560 x 1 MiB alloc/free, full quarantine):

config PROT_NONE MADV_GUARD_INSTALL
default (no labeling) +542 +535
CONFIG_LABEL_MEMORY +3094 +551

That's a ~5.6x reduction under CONFIG_LABEL_MEMORY (the Android default), and <= PROT_NONE in every config. (Counts vary with the randomized guard sizes.) Whole-process operations with 2000 live allocations: the /proc/self/smaps payload drops from 2844 KiB to 46 KiB, so code that does work per VMA scans far less, and VMA-dominated fork() latency drops ~32%.

Verification

  • Builds clean with -Werror under gcc and clang, feature off and on, with and without CONFIG_LABEL_MEMORY; the CI matrix (gcc, clang, musl) now also runs the test suite with the feature enabled. All 56 tests pass in every configuration.
  • Five new regression tests cover large-allocation underflow, aligned-allocation overflow/underflow, and the realloc-shrink guard and discarded-tail paths; they assert SIGSEGV under both guard schemes, on any kernel.
  • On a real 6.13+ kernel: guards fault on overflow, underflow, use-after-free (quarantine), and after in-place realloc shrink, with and without CONFIG_LABEL_MEMORY; quarantined, shrunk and mremap-grown regions stay single-VMA.
  • mlockall(MCL_FUTURE) latches the feature off via the probe and all allocations succeed on the PROT_NONE scheme; freeing an mlock'd allocation falls back for that call only.
  • Failure paths are exercised directly by injecting madvise faults with strace: with every call failing ENOMEM, failing from the 7th call onward, and failing EINVAL intermittently (forcing repeated re-probes and mixed-scheme allocations), the full suite passes and guards still fault through the fallbacks.
  • UBSan clean (suite + churn + realloc shrink/grow). A 2-minute 8-thread randomized stress (malloc/memalign/realloc/free with per-allocation pattern verification, plus mlock'd frees racing the probe) completes ~230k operations with no corruption; the only cross-thread state is the single atomic feature flag.
  • The probe trusts madvise's return value, so the feature must be validated on a real kernel: qemu-user silently no-ops MADV_GUARD_INSTALL, which would leave large allocations without guards. This is a reason it must stay opt-in.

@rdevshp

rdevshp commented May 30, 2026

Copy link
Copy Markdown
Contributor
#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/mman.h>

int main(void) {
    const size_t size = 256 * 1024;

    errno = 0;
    void *warm = malloc(size);
    if (warm == NULL) {
        printf("warmup_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 2;
    }
    printf("warmup_large_malloc=ok ptr=%p\n", warm);

    errno = 0;
    int lock_ret = mlockall(MCL_FUTURE | MCL_ONFAULT);
    printf("mlockall_mcl_future_ret=%d errno=%d (%s)\n", lock_ret, errno, strerror(errno));

    errno = 0;
    void *after = malloc(size);
    if (after == NULL) {
        printf("post_mlock_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 1;
    }

    printf("post_mlock_large_malloc=ok ptr=%p\n", after);
    return 0;
}

This produces a regression for this program. When CONFIG_GUARD_PAGES_USE_MADVISE is false, the program runs normally, but when CONFIG_GUARD_PAGES_USE_MADVISE is true, the malloc after mlockall(MCL_FUTURE | MCL_ONFAULT); fails with errno=22.

@thomasbuilds thomasbuilds marked this pull request as draft May 30, 2026 19:21
@thomasbuilds thomasbuilds force-pushed the madvise-guard-install branch from 9e3e3a6 to f54ee16 Compare May 30, 2026 19:43
@thomasbuilds thomasbuilds marked this pull request as ready for review June 6, 2026 08:36
@thomasbuilds thomasbuilds force-pushed the madvise-guard-install branch 2 times, most recently from ded5838 to 35a0009 Compare June 6, 2026 12:08
Add CONFIG_GUARD_PAGES_USE_MADVISE (default false) to install the
guard regions of large allocations with MADV_GUARD_INSTALL (Linux
6.13+) inside a single read-write mapping instead of as separate
PROT_NONE mappings, keeping each large allocation to one VMA instead
of three. The single-VMA property is preserved through
allocate_pages(), allocate_pages_aligned(), the region quarantine and
the in-place realloc shrink so it holds under allocation churn,
including under CONFIG_LABEL_MEMORY where the quarantined region is
named as a whole to avoid splitting the VMA. Guard install zaps any
existing pages in the range, so the quarantine still purges data and
frees resident memory with a single system call, the same count as
the PROT_NONE remap it replaces; allocation costs one extra system
call (mmap + 2 madvise instead of mmap + mprotect).

Kernel support is probed on a fresh mapping at runtime and cached.
Guard installation is best-effort: any madvise failure falls back to
the PROT_NONE scheme. EINVAL means the specific mapping can't be
guarded (VM_LOCKED), so it resets the cached state to force a
re-probe: under mlockall(MCL_FUTURE) the probe mapping is itself
locked and latches the feature off, while freeing a one-off mlock'd
allocation only loses the single call. errno is preserved across the
fallback.

It is off by default because the guard bytes and quarantined regions
are then accounted as committed memory (resident memory and total
address space are unchanged), which regresses strict overcommit
(vm.overcommit_memory=2).

Add large allocation guard regression tests covering underflow,
aligned allocation overflow/underflow and the in-place realloc shrink
paths, which apply to both guard schemes, and build the new
configuration in CI.
@thomasbuilds thomasbuilds force-pushed the madvise-guard-install branch from 35a0009 to 6252879 Compare June 10, 2026 10:34
@thomasbuilds

Copy link
Copy Markdown
Contributor Author

Thanks @rdevshp, the PR got updated quite a lot since your last review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants